STAM101 :: Lecture 12 :: Correlation – definition – Scatter diagram -Pearson’s correlation co-efficient – properties of correlation coefficient

Correlation
Correlation is the study of relationship between two or more variables. Whenever we conduct any experiment we gather information on more related variables. When there are two related variables their joint distribution is known as bivariate normal distribution and if there are more than two variables their joint distribution is known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution, we are interested in discovering and measuring the magnitude and direction of relationship between 2 or more variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y, the variables are said to be correlated. In other words, the systematic relationship between the variables is termed as correlation. When only 2 variables are involved the correlation is known as simple correlation and when more than 2 variables are involved the correlation is known as multiple correlation. When the variables move in the same direction, these variables are said to be correlated positively and if they move in the opposite direction they are said to be negatively correlated.

 
Scatter Diagram

To investigate whether there is any relation between the variables X and Y we use scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the resultant diagram of dots is known as scatter diagram.  From the scatter diagram we can say whether there is any correlation between x and y and whether it is positive or negative or the correlation is linear or curvilinear.

 

                                                 
        

     Positive Correlation                                         Negative correlation

 
 
 
 
 
 

            Curvilinear                                                       no correlation

           (or) non  linear

 
Pearsons Correlation coefficient

            The measures of the degree of relationship between two continuous variables is called correlation coefficient. It is denoted by r.( in case of sample )and r (in case of population). The correlation coefficient r is known as Pearson’s correlation coefficient as it was discovered by Karl Pearson. It is also called as product moment correlation.

The correlation coefficient r is given as the ratio of covariance of the variables X and Y to the product of the standard deviation of X and Y.
Symbolically,

which can be simplified as

This correlation coefficient r is known as Pearson’s Correlation coefficient.  The numerator is termed as sum of product of X and Y and abbreviated as SP(XY). In the denominator the first term is called sum off squares of X (i.e) SS(X) and second term is called sum of squares of Y (i.e) SS(Y)
\
The denominator in the above formula is always positive. The numerator may be positive or negative making r to be either positive or negative.

Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions, they are

  1. The variables under study are continuous random variables and they are normally distributed
  2. The relationship between the variables is linear
  3. Each pair of observations is unconnected with other pair (independent)
 
Properties
  1. The correlation coefficient value ranges between –1 and +1.
  2. The correlation coefficient is not affected by change of origin or scale or both.
  3. If     r > 0 it denotes positive correlation

              r<  0 it denotes negative correlation between the two variables x and y.
r = 0 then the two variables x and y are not linearly correlated.(i.e)two          
variables are independent.
r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.

Testing the significance of r
The significance of r can be tested by Student’s t test. The test statistics is given by

This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
The relationship between the variables is interpreted by the square of the correlation coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the variation in one variable is caused by the variation of the other variable. The coefficient of determination is used to compare 2 correlation coefficients.

Problem
Compute Pearsons coefficient of correlation between plant height (cm) and yield (Kgs) as per the data given below:


Plant Height (cm)

39

65

62

90

82

75

25

98

36

78

Yield in Kgs

47

53

58

86

62

68

60

91

51

84

Solution
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data
n = 10
  

                                 

Correlation coefficient is positively correlated.
Test Statistic

                   
ttab=t(10-2, 5%los)=2.306

Inference
t> ttab, we reject null hypothesis.
\The correlation coefficient r is significant. (i.e) there is a relation between plant height and yield.

Download this lecture as PDF here